To perform a cluster analysis in R, generally, the data should be prepared as follows:
Rows are observations (individuals) and columns are variables.
Any missing value in the data must be removed or estimated.
The data must be standardized (i.e., scaled) to make variables comparable.
I group the data by medium & year to compute the mean value for each p_group:
| Bündnis 90/ Die Grüne | CDU/CSU | FDP | Linke/PDS/WASG | SPD | |
|---|---|---|---|---|---|
| BamS | -0.084 | -0.023 | -0.037 | -0.171 | -0.081 |
| Bericht aus Berlin | -0.111 | -0.115 | -0.153 | -0.148 | -0.125 |
| Berlin direkt | -0.099 | -0.085 | -0.089 | -0.158 | -0.098 |
| Berliner | -0.065 | -0.088 | -0.062 | -0.082 | -0.077 |
| Bild | -0.105 | -0.048 | -0.043 | -0.161 | -0.085 |
| Die Welt | -0.101 | -0.047 | -0.039 | -0.124 | -0.096 |
| Die Woche | -0.107 | -0.12 | -0.078 | -0.046 | -0.06 |
| Die Zeit | -0.11 | -0.101 | -0.121 | -0.083 | -0.093 |
| F.A.S. | -0.062 | -0.04 | -0.059 | -0.115 | -0.08 |
| F.A.Z. | -0.077 | -0.041 | -0.034 | -0.076 | -0.074 |
| Fakt | -0.118 | -0.104 | -0.245 | -0.164 | -0.215 |
| Focus | -0.118 | -0.052 | -0.049 | -0.153 | -0.11 |
| Fr. Rundschau | -0.066 | -0.102 | -0.083 | -0.08 | -0.078 |
| Frontal 21 | -0.098 | -0.187 | -0.178 | -0.091 | -0.19 |
| heute | -0.049 | -0.07 | -0.071 | -0.077 | -0.062 |
| heute journal | -0.055 | -0.085 | -0.09 | -0.105 | -0.07 |
| Kontraste | -0.111 | -0.184 | -0.155 | -0.123 | -0.204 |
| Monitor | -0.158 | -0.238 | -0.25 | -0.172 | -0.148 |
| Panorama | -0.106 | -0.182 | -0.22 | -0.37 | -0.201 |
| Plusminus | -0.201 | -0.154 | -0.226 | 0.042 | -0.168 |
| ProSieben | -0.098 | -0.07 | -0.064 | -0.1 | -0.074 |
| Report (BR) | -0.139 | -0.133 | -0.117 | -0.208 | -0.233 |
| Report (SWR) | -0.086 | -0.223 | -0.348 | -0.18 | -0.172 |
| Rh. Merkur | -0.128 | -0.046 | -0.053 | -0.129 | -0.118 |
| RTL Aktuell | -0.075 | -0.072 | -0.082 | -0.075 | -0.064 |
| Sat.1 News | -0.094 | -0.058 | -0.04 | -0.124 | -0.078 |
| Spiegel | -0.066 | -0.089 | -0.098 | -0.092 | -0.064 |
| Stern | -0.064 | -0.088 | -0.063 | -0.026 | -0.087 |
| Super Illu | -0.208 | -0.051 | -0.096 | -0.062 | -0.135 |
| SZ | -0.063 | -0.088 | -0.066 | -0.086 | -0.077 |
| Tagesschau | -0.055 | -0.08 | -0.071 | -0.075 | -0.063 |
| Tagesthemen | -0.071 | -0.091 | -0.097 | -0.082 | -0.075 |
| tageszeitung | -0.071 | -0.124 | -0.098 | -0.091 | -0.09 |
| WamS | -0.097 | -0.035 | -0.041 | -0.122 | -0.119 |
| WISO | -0.134 | -0.074 | -0.107 | 0.026 | -0.086 |
| Bündnis 90/ Die Grüne | CDU/CSU | FDP | Linke/PDS/WASG | SPD | |
|---|---|---|---|---|---|
| BamS | -0.008 | -0.011 | -0.005 | -0.003 | -0.023 |
| Bericht aus Berlin | -0.01 | -0.048 | -0.026 | -0.012 | -0.03 |
| Berlin direkt | -0.009 | -0.037 | -0.012 | -0.01 | -0.026 |
| Berliner | -0.012 | -0.03 | -0.005 | -0.005 | -0.025 |
| Bild | -0.009 | -0.022 | -0.005 | -0.005 | -0.028 |
| Die Welt | -0.013 | -0.019 | -0.004 | -0.005 | -0.031 |
| Die Woche | -0.021 | -0.042 | -0.006 | -0.003 | -0.019 |
| Die Zeit | -0.017 | -0.036 | -0.007 | -0.004 | -0.036 |
| F.A.S. | -0.008 | -0.016 | -0.005 | -0.005 | -0.027 |
| F.A.Z. | -0.01 | -0.016 | -0.003 | -0.004 | -0.024 |
| Fakt | -0.013 | -0.033 | -0.014 | -0.016 | -0.09 |
| Focus | -0.012 | -0.023 | -0.005 | -0.005 | -0.034 |
| Fr. Rundschau | -0.011 | -0.035 | -0.006 | -0.004 | -0.028 |
| Frontal 21 | -0.008 | -0.088 | -0.023 | -0.005 | -0.05 |
| heute | -0.006 | -0.03 | -0.008 | -0.003 | -0.019 |
| heute journal | -0.006 | -0.038 | -0.01 | -0.004 | -0.022 |
| Kontraste | -0.012 | -0.067 | -0.012 | -0.012 | -0.071 |
| Monitor | -0.016 | -0.102 | -0.03 | -0.003 | -0.049 |
| Panorama | -0.006 | -0.078 | -0.039 | -0.012 | -0.062 |
| Plusminus | -0.016 | -0.058 | -0.037 | 0 | -0.064 |
| ProSieben | -0.013 | -0.028 | -0.004 | -0.002 | -0.028 |
| Report (BR) | -0.017 | -0.054 | -0.008 | -0.009 | -0.083 |
| Report (SWR) | -0.005 | -0.101 | -0.041 | -0.007 | -0.057 |
| Rh. Merkur | -0.018 | -0.019 | -0.004 | -0.005 | -0.038 |
| RTL Aktuell | -0.007 | -0.033 | -0.007 | -0.002 | -0.022 |
| Sat.1 News | -0.01 | -0.025 | -0.002 | -0.004 | -0.03 |
| Spiegel | -0.007 | -0.037 | -0.008 | -0.004 | -0.023 |
| Stern | -0.008 | -0.035 | -0.005 | -0.001 | -0.032 |
| Super Illu | -0.011 | -0.02 | -0.005 | -0.013 | -0.04 |
| SZ | -0.01 | -0.034 | -0.005 | -0.004 | -0.026 |
| Tagesschau | -0.006 | -0.034 | -0.008 | -0.003 | -0.019 |
| Tagesthemen | -0.008 | -0.039 | -0.011 | -0.003 | -0.023 |
| tageszeitung | -0.017 | -0.039 | -0.006 | -0.006 | -0.029 |
| WamS | -0.01 | -0.015 | -0.004 | -0.004 | -0.04 |
| WISO | -0.011 | -0.029 | -0.013 | 0.002 | -0.028 |
Most commonly used unsupervised ML algorithm for partitioning a given data set into a set of k clusters, where k represents the number of pre-specified groups.
It classifies objects in multiple clusters, where each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.
k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized.
The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:
\[ W(C_k)=\sum_{x_i\in C_k}(x_i-\mu_k)^2 \] where:
Each observation (\(x_i\)) is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers (\(\mu_k\)) is minimized.
The object function to be minimized is the total within-cluster sum of square:
\[ \text{tot.withiness} = \sum^k_{k=1}W(C_k)=\sum^k_{k=1}\sum_{x_i\in C_k}(x_i-\mu_k)^2 \] ### K-means Algorithm
K-means algorithm can be summarized as follows:
Specify the number of clusters (K) to be created (by the analyst).
Select randomly k objects from the data set as the initial cluster centers or means.
Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid.
For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centroid of a \(Kth\) cluster is a vector of length \(p\) containing the means of all variables for the observations in the \(kth\) cluster; \(p\) is the number of variables.
Iteratively minimize the total within sum of square (Equation above). That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached.
The output of kmeans is a list with several bits of information. The most important being:
If we print the results we’ll see that our groupings resulted in 3 cluster sizes of 29, 49, 287. We see the cluster centers (means) for the three groups across the four variables (Bündnis 90/ Die Grüne, CDU/CSU, FDP, SPD). We also get the cluster assignment for each observation (i.e. BamS was assigned to cluster 3 in year 2001, Bericht aus Berlin was assigned to cluster 1 in 2005, etc.).
We can also view our results by using fviz_cluster. This provides a nice illustration of the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.